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Abstract. A key data preparation step in Text Mining, Term Extraction selects 
the terms, or collocation of words, attached to specific concepts. In this paper, the 
task of extracting relevant collocations is achieved through a supervised learning 
algorithm, exploiting a few collocations manually labelled as relevant/irrelevant. 
The candidate terms are described along 13 standard statistical criteria measures. 
From these examples, an evolutionary learning algorithm termed Roger, based 
on the optimization of the Area under the ROC curve criterion, extracts an order 
on the candidate terms. The robustness of the approach is demonstrated on two 
real-world domain applications, considering different domains (biology and human 
resources) and different languages (English and French). 
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1 Introduction 

Besides the known difficulties of Data Mining, Text Mining presents spe¬ 
cific difficulties due to the structure of natural language. In particular, the 
polysemy and synonymy effects are dealt with by constructing ontologies or 
terminologies |Bourigault and Jacquemin, 1999| , structuring the words and 
their meanings in the domain application. A preliminary step for ontology 
construction is to extract the terms, or word collocations, attached to the con¬ 
cepts defined by the expert [Bourigault and Jacquemin, 1999|[5madja, 1993| . 
Term Extraction actually involves two steps: the detection of the relevant 
collocations, and their classification according to the concepts. 

This paper focuses on the detection of relevant collocations, and presents a 
learning algorithm for ranking collocations with respect to their relevance, in 
the spirit of |Cohen et al, 19991 ■ An evolutionary algorithm termed Roger, 
based on the optimization of the Receiver Operating Characteristics (ROC) 
curve Ferri et al., 2002) |Rosset, 2004|, and already described in previous 
works 5e1oaj^^i^^003an5el3ag et al, 2003b| , is applied to a few colloca¬ 
tions manually labelled as relevant/irrelevant by the expert. The optimiza¬ 
tion of the ROC curve is directly related to the recall-precision tradeoff in 
Term Extraction (TE). 

The paper is organized as follows. Section[2]briefly reviews the main crite¬ 
ria used in TE. Section [3 presents the Roger (ROc-based GEnetic learneR) 
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algorithm for the sake of self-containedness, and describes the bagging of 
the diverse hypotheses constructed along independent runs. Sections 21 et 
21 report on the experimental validation of the approach on two real-world 
domain applications, and the paper ends with some perspectives for further 
research. 

2 Measures for Term Extraction 

The choice of a q uality measure among the great many criteria used in T ext 
Mining (see e.g., |Daille et al., T998l|Xu et al, 2002||Roche et al., 2004bl ) is 
currently viewed as a decision making process; the expert has to find the 
criterion most suited to his/her corpus and goals. The criteria considered in 
the rest of the paper are: 

• Mutual Information (MI) [Church andJHanksM990j ^ 

• Mutual Information wit h cube 

• Dice Coefficient (Dicej jSmadja^et a/., D?96| 

• Log-likelihood (L) [DunningM^93] 

• Number of occurrences + Log-likelihood (Occl ) 1 |Roche et al., 20 04a 

• Association Measure (Ass ) IJacqueminMQQTI 

• Sebag-Schoena uer ( SeSc ) |Sebag and Scho enauer, 1988| 

• J-measure (J) |Goodman~an^5m^?^h)88l 

• Conviction (Comj)~ |Bn^T^/^^97| 

• Least contradiction (LC) [Aze^m^Kodratoff, 2004| 

• Cote multiplier (CM) jLamc!^mc^Rytau^^00 4 

• Khi2 test used in text mining (R'/w2)^^^^^^^^^^2chihze^h}99] 

• T-test used in text mining (Ttest) [Mmmm^^n^^chihze^^99| 

Vivaldi et al. |Vivaldi et al , 2001] have shown that the search for a quality 
measure can be formalized as a supervised learning problem. Considering a 
training set, where each candidate term is described from its value for a 
set of statistical criteria and labelled by the expert, they used Adaboost 
ISchapire, 1999| to automatically construct a classifier. 

The approach presented in next section mostly differs from |Vivaldi et al., 2f)~0T| 
as it learns an ordering function (term t\ is more relevant than term t?) in¬ 
stead of a boolean function (term t is relevant/irrelevant). 


3 Learning ranking functions 

This section first briefly recalls the Roger (ROc-based GEnetic learneR) 
algorithm, used for learning a ranking hypothesis and first described in 
|Sebag et al ., 2003b||Sebag et al., 2003a| . The n’Roger variant used in this 


1 Occl is defined by ranking collocations according to their number of occurrences, 
and breaking the ties based on the term Log-likelihood. 
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paper involves two extensions: i) the use of non-linear ranking hypothe¬ 
ses; ii) the exploitation of the ensemble of hypotheses learned along in¬ 
dependent runs of Roger. Using the standard notations, the dataset £ 
= {(x ,;,i = l..n,Xj € € { —1,+1}} includes n collocation ex¬ 

amples, where each collocation x, is described by the value of d statistical 
criteria, and its label yi denotes whether collocation x,; is relevant. 

3.1 Roger 

The learning criterion used in Roger is the Wilcoxon rank test, measuring 
the probability that a hypothesis h ranks x,; higher than x^- when x, is a 
positive and x ;i is a negative example: 

W(h) = Pr(h(Xi) > h(xj) \ yi > yj ) (1) 

This criterion, with quadratic complexity in the number n of examples 2 offers 
an increased stability compared to the misclassification rate ( Pr(h(xi).yi > 
0), with linear complexity in n); see |Rosset, 2004| and references therein. 
The Wilcoxon rank test is equivalent to the area under the ROC (Receiver 
Operating Characteristics) curve |Jin et al., 2003| . This curve, intensively 
used in medical data analysis, shows the trade-off between the true positive 
rate (the fraction of positive examples that are correctly classified, aka re¬ 
call) and the false positive rate (the fraction of negative examples that are 
misclassified) achieved by a given hypothesis/classifier/learning algorithm. 
Therefore, the area under the ROC curve (AUC) does not depend on the 
imbalance of the trai ning set [Kolcz_et_aj 1L _2003| , as oppo sed to other mea¬ 
sures such as Fscore |Caruanaand I^icuIesciT Mizil, 2004| . The ROC curve 
also shows the misclassification rates achieved depending on the error cost 
coefficients |Domingos, 1999| . For these reasons, |Bradley, 1997| argues the 
comparison of the ROC curves attached to two learning algorithms to be 
more fair and informative, than comparing their misclassification rates only. 
Accordingly, the area under the ROC curve defines a new learning criterion, 
used e.g. for the evolutionary optimization of neural nets |Fogel et al., 1995] , 
or the greedy search of decision trees |Ferri et al ., ~2002] . 

In an earlier step |Sebag et al. , 2003b| , the search space Ti considered is 
that of linear hypotheses (Ti = IT 1 ). To each vector w in R d is attached 
hypothesis h w with h w (x ) =< w,x >, where < w,x > denotes the scalar 
product of w and x. Hypothesis h defines an order on IR d , which is evaluated 
from the Wilcoxon rank test on the training set £ (Eq. QJ, measured after 
cross-validation. 

The combinatorial optimization problem defined by Eq. Q thus mapped 
onto a numerical optimization problem, is tackled by Evolution Strategies 
(ES). ES are the Evolutionary Computation algorithms that are best suited 

2 Actually, the computational complexity is in C(nlogn) since W(h) is propor¬ 
tional to the sum of ranks of the positive examples. 
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to parameter optimization; the interested reader is referred to |Back, 1995| 
for an extensive presentation. In the rest of the paper, Roger employs a 
(fj, + A)-ES, involving the generation of A offspring from /i parents through 
uniform crossover and self-adaptive mutation, and deterministically selecting 
the next /i parents from the best parents + A offspring. 


3.2 Extensions 

An extension first presented in |Jong et al., 2004] concerns the use of non¬ 
linear hypotheses. Exploiting the flexibility of Evolutionary Computation, 
the search space H is set to Il r/ x IT*; each hypothesis h = ( w , c), composed 
of a weight vector w and a center c, associates to x the weighted Ti-distance 
of x and c: 


d 

h(x = (xi, ...,x d )) = ^2, w i\xi - Ci | 

i=l 

It must be noted that this representation allows Roger for searching (a lim¬ 
ited kind of) non linear hypotheses, by (only) doubling the size of the linear 
search space. Previous work has shown that non-linear Roger significantly 
outperforms linear Roger for some text mining applications JRoche^etaL, 2004a| . 

A new extension, inspired from ensemble learning |TTremT^ 1998| , e~ 
ploits the hypotheses hi,..., hx learned along T independent runs of Roger. 

The aggregation of the (normalised) hi, referred to as H, associates to each 
example x the median value of {hi(x ),..., hrix)}. 


4 Goals of Experiments and Experimental Setting 

The goal of experiments is twofold. On one hand, the ranking efficiency 
of n’ Roger will be assessed and compared to that of state-of-the-art su¬ 
pervised learning algorithms, specifically Support Vector Machines with lin¬ 
ear, quadratic and Gaussian kernels, using SVMTorch implementation 3 with 
default options. Due to space limitations, only ensemble-based non-linear 
Roger, termed n’Roger, will be considered. 

On the other hand, the results provided by n’Roger will be interpreted 
and discussed with respect to their intelligibility. The experimental setting 
is as follows. An experiment is a 5-fold stratified cross-validation process; 
on each fold, i) SVM learns a hypothesis hsvM', ii) Roger is launched 21 
times, and the bagging of the 21 learned hypotheses constitutes the hypoth¬ 
esis h n 'Roger learned by n’Roger; iii) both hypotheses are evaluated on the 
fold test set and the associated ROC curve (True Positive Rate vs False Pos¬ 
itive Rate) is constructed. The AUC curves are averaged over the 5 folds. 

3 http://www.idiap.ch/macliine_learning.php?content=Torch/en_0IdSVMTorch.txt 
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The overall results reported in the next section are averaged over 10 ex¬ 
periments (10 different splits of the dataset into 5 folds). 

The Roger parameters are as follows: /z = 20; A = 100; the self adapta- 
tive mutation rate is 1.; the uniform crossover rate is .6. 

5 Empirical validation 

After describing the datasets, this section reports on the comparative per¬ 
formances of the algorithms, and inspects the results actually provided by 
n’Roger. 

5.1 Datasets 

In both domains, the data preparation step [Roche et al. } 2004b) allows for 
categorizing the word collocations depending on the grammatical tag of the 
words (e.g. Adjective, Noun). 

A first corpus related to Molecular Biology involves 6119 paper abstracts 
in English (9,4 Mo) gathered from queries on Medline 4 . The 1028 Noun- 
Noun collocations occurring more than 4 times are labelled by the expert; 
the dataset includes a huge majority of relevant collocations (Table 0. 

A second corpus related to Curriculum Vitae 5 involves 582 CVs in French 
(952 Ko). The “Frequent CV” dataset includes the 376 Noun-Adjective collo¬ 
cations with at least 3 occurrences (two hours labelling required), with a huge 
majority of relevant collocations. The “Infrequent CV” dataset includes the 
2822 Noun-Adjective collocations occurring once or twice (two days labelling 
required), with a significantly different distribution of relevant/irrelevant col¬ 
locations (Table QJ. Examples of relevant vs irrelevant collocations are re¬ 
spectively competences informatiques and euros annuels ; 

although both collocations make sense, only the first one conveys useful 
information for the management of human resources. 


Collocations 

# collocations 

Relevant 

Irrelevant 

Molecular Biology 

1028 

90.9% 

9U% 

CV, Frequent collocations 

376 

85.7% 

14.3% 

CV, Infrequent collocations 

2822 

56.6% 

43.4% 


Table 1. Relevant and irrelevant collocations. 


5.2 Ranking accuracy 

After the experimental setting described in section 0J Table El compares the 
average AUC achieved for n’Roger and SVMTorch with linear, Gaussian 

4 http://www.ncbi.nlm.nih.gov/entrez/query.fcgi 

5 Courtesy of the VediorBis Foundation. 
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and quadratic kernels. On these domain applications, both supervised learn¬ 
ing approaches significantly improve on the statistical criteria standalone 
(TableOJ. Further, n’Roger improves significantly on SVM using any ker¬ 
nel, excepted on the Infrequent CV dataset. A tentative interpretation for 
this result is based on the fact that this dataset is the most balanced one; 
SVM has some difficulties to cope with imbalanced datasets. 


Corpus 

n’Roger 
(~ 17s/fold) 

SV1 

Linear 

VI (~ 1.5s/fc 
Gaussian 

)ld) 

Quadratic 

Molecular Biology (MB) 

0.73 ± 0.05 

0.50 ± 0.08 

0.46 ± 0.08 

0.59 ± 0.08 

Frequent CV (F-CV) 

0.64 ± 0.08 

0.48 ± 0.08 

0.48 ± 0.08 

0.50 ± 0.10 

Infrequent CV (I-CV) 

0.73 ± 0.01 

0.72 ± 0.01 

0.72 ± 0.02 

0.71 ± 0.02 


Table 2. Ranking accuracy (Area under the ROC curve) of learning algo¬ 
rithms. 


Corpus 

MI 

MI* 

Dice 

L 

Occl 

Ass 

j 

Conv 

SeSc 

CM 

LC 

Ttest 

Khi2 

MB 

0.30 

0.35 

0.31 

0.42 

0.57 

0.31 

0.59 

0.35 

0.43 

0.31 

0.46 

0.31 

0.31 

F-CV 

0.31 

0.40 

0.39 

0.43 

0.58 

0.32 

0.58 

0.39 

0.40 

0.31 

0.44 

0.36 

0.36 

I-CV 

0.29 

0.30 

0.33 

0.30 

0.37 

0.29 

0.50 

0.40 

0.39 

0.30 

0.45 

0.30 

0.30 


Table 3. Ranking accuracy (Area under the ROC curve) of statistical criteria. 


A more detailed picture is provided by Fig. Q showing the ROC curve 
associated to SVM, n’Roger and the Occl and J measures on the Frequent 
CV dataset on a representative fold (termed RF in this paper). Interestingly, 
the major differences between n’Roger and the other measures are seen at 
the beginning of the curve, i.e. they concern the top ranked collocations. 
Typically, a recall (True Positive Rate) of 50% is obtained for 18% false 
positive with n’Roger, against 23% with Occl, 31% with J measures and 
68% for quadratic SVM 6 . 

In summary, n’Roger improves the accuracy of the top-ranked colloca¬ 
tions, and therefore the satisfaction and productivity of the expert if he/she 
only examines the top results. A proof of principle of the generality of the 
approach has been presented in [Roche et al., 2004b| , as the ranking func¬ 
tion learned from one corpus, in one language, was found to outperform the 
standard statistical criteria when applied on the other corpus, in another 
language. 

5.3 Analysis of a ranking function 

As shown in |Jong et al., 2004| , the weights associated to distinct features by 
Roger can provide some insights into the relevance of the features. Accord¬ 
ingly, the hypotheses constructed by n’Roger are examined. 

Fig. |21 displays the weights and center coordinates of all 13 features (sec¬ 
tion n for a representative Roger hypothesis h (closest to the ensemble 


SVM ROC Curves is not significant as its AUC is lower than .5 on this test fold. 
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False Positive Rate 


Fig. 1 . ROC Curves on Frequent Collocations of CV corpus (for the test set 
of RF). 



Fig. 2. Weights ( Wj,Cj ) on the frequent CVs (for the learn set of RF). 


N’Roger hypothesis H) learned on a fold of the Frequent CV dataset. Al¬ 
though AUC(/i) is lower than that of H (.61 vs .64), it still outpasses that of 
standalone features (statistical criteria). 

As could have been expected, Roger detects that the mutual informa¬ 
tion (MI) criterion does badly (AXJC(MI)= .31, Table |3I) : with a high cen¬ 
ter cm i and weight wmi values (collocations with high MI are less rele¬ 
vant, everything else being equal). Inversely, as the Occl criterion does well 
(AUC(Occl) = -58), the center co CCL is high associated with a highly neg¬ 
ative weight wocc L (collocations with low Occl are less relevant, everything 
else being equal) (see Tab. 0. 

Although these tendencies could have been exploited by linear hypotheses, 
this is no longer the case for the J criterion (AUC(J) = .58): interestingly, 
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the center cj takes on a medium value, with a high negative weight wj. This 
is interpreted as collocations with either too low or too high values of J, 
are less relevant everything else being equal. The current limitation of the 
approach is to provide a “conjunctive” description of the region of relevant 
collocations 7 . 


MI 

wm I = 0.68 
cm I = 0.59 


OcCL 

WOcc L = - 0.41 
L. 


n’Roger 


Collocation 


Rank 


Rank 


Rank 


experience commerciale 
formation informatique 
societe informatique 
gestion informatique 


297 
300 

298 

299 


258 

123 

299 

76 


colonne morris 
bouygue telecom 
fromagerie riches-mont 
sauveteur secouriste 


211 

213 

212 

151 


90 

298 

297 

296 


experience professionelle 
ressource humaine 
baccalaureat professionel 
baccalaureat scientifique 


146 

44 

193 

148 


300 

299 

22 

58 


Table 4. Rank of relevant collocations given with 2 measures (MI and Occl ) and 
n’Roger. For each measure the weights ( Wi , d) used by n’Roger are given (on 
the learn set of RF). 


6 Discussion and Perspectives 

The main claim of the paper is that supervised learning can significantly con¬ 
tribute to the Term Extraction task in Text Mining. Some empirical evidence 
supporting this claim have been presented, related to two corpora with differ¬ 
ent domain applications and languages. Based on a domain- and language- 
independent description of the collocations along a set of standard statistical 
criteria, and on a few collocations manually labelled as relevant/irrelevant by 
the expert, a ranking hypothesis is learned. 

The ranking learner n’Roger used in the experiments is based on the 
optimization of the combinatorial Wilcoxon rank test criterion, using an evo¬ 
lutionary computation algorithm. Two new features, the use of non-linear 
hypotheses and the exploitation of the ensemble of hypotheses learned along 
independent runs of Roger, have been exploited in n’Roger. 

Further research is concerned with enriching the description of colloca¬ 
tions, e.g. adding typography-related indications (e.g. distance to the closest 
typographic signs) or distance to the closest Noun, possibly providing ad¬ 
ditional cues on the role of relevant collocations. Another perspective is to 
extend Roger using multi-modal and multi-objective evolutionary optimiza¬ 
tion |Deb, 20011 , e.g. enabling to characterize several types of relevant collo¬ 
cations in a single run. A long-term goal is to study along a variety of domain 
applications and expert goals, the eventual regularities associated to i) the 

' In the sense that a single center c is considered, though the condition far from 
a actually corresponds to a disjunction. 
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(domain and language independent) description of the relevant collocations; 
ii) the ranking hypotheses. 
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